7 research outputs found

    Improving prefetching mechanisms for tiled CMP platforms

    Get PDF
    Recently, high performance processor designs have evolved toward Chip-Multiprocessor (CMP) architectures to deal with instruction level parallelism limitations and, more important, to manage the power consumption that is becoming unaffordable due to the increased transistor count and clock frequency. At the present moment, this architecture, which implements multiple processing cores on a single die, is commercially available with up to twenty four processors on a single chip and there are roadmaps and research trends that suggest that number of cores will increase in the near future. The increasing on number of cores has converted the interconnection network in a key issue that will have significant impact on performance. Moreover, as the number of cores increases, tiled architectures are foreseen to provide a scalable solution to handle design complexity. Network-on-Chip (NoC) emerges as a solution to deal with growing on-chip wire delays. On the other hand, CMP designs are likely to be equipped with latency hiding techniques like prefetching in order to reduce the negative impact on performance that, otherwise, high cache miss rates would lead to. Unfortunately, the extra number of network messages that prefetching entails can drastically increase power consumption and the latency in the NoC. In this thesis, we do not develop a new prefetching technique for CMPs but propose improvements applicable to any of them. Specifically, we analyze the behavior of the prefetching in the CMPs and its impact to the interconnect. We propose several dynamic management techniques to improve the performance of the prefetching mechanism in the system. Furthermore, we identify the main problems when implementing prefetching in distributed memory systems like tiled architectures and propose directions to solve them. Finally, we propose several research lines to continue the work done in this thesis.Recentment l'arquitectura dels processadors d'altes prestacions ha evolucionat cap a processadors amb diversos nuclis per a concordar amb les limitacions del paral·lelisme a nivell d'instrucció i, mes important encara, per tractar el consum d'energia que ha esdevingut insostenible degut a l'increment de transistors i la freqüència de rellotge. Ara mateix, aquestes arquitectures, que implementes varis nuclis en un sol xip, estan a la venta amb mes de vint-i-quatre processadors en un sol xip i hi ha previsions que suggereixen que aquest nombre de nuclis creixerà en un futur pròxim. Aquest increment del nombre de nuclis, ha convertit la xarxa que els connecta en un punt clau que tindrà un impacte important en el seu rendiment. Una topologia de xarxa que sembla que serà capaç de proveir una solució escalable per aquestes arquitectures ha estat la topologia tile. Les xarxes en el xip (NoC) es presenten com la solució del increment de la latència dels cables del xip. Per altre banda, els dissenys de multiprocessadors seguiran disposant de tècniques de reducció de latència de memòria com el prefetch per tal de reduir l'impacte negatiu en rendiment que, altrament, tindríem degut als elevats temps de latència en fallades a memòria cache. Desafortunadament, el gran nombre de peticions destinades a prefetch, pot augmentar dràsticament la congestió a la xarxa i el consum d'energia. En aquesta tesi, no desenvolupem cap tècnica nova de prefetching, però proposem millores aplicables a qualsevol d'ells. Concretament analitzem el comportament del prefetching en multiprocessadors i el seu impacte a la xarxa. Proposem diverses tècniques de control dinàmic per millor el rendiment del prefetcher al sistema. A més, identifiquem els problemes principals d'implementar el prefetching en els sistemes de memòria distribuïts com els de les arquitectures tile i proposem línies d'investigació per solucionar-los. Finalment, també proposem diverses línies d'investigació per continuar amb el treball fet en aquesta tesi.Postprint (published version

    Improving the prefetching performance through code region profiling

    Get PDF
    In this work, we propose a new technique to improve the performance of hardware data prefetching. This technique is based on detecting periods of time and regions of code where the prefetcher is not working properly, thus not providing any speedup or even producing slowdown. Once these periods of time and regions of code are detected, the prefetcher may be switched off and later on, switched on. To efficiently implement such mechanism, we identify three orthogonal issues that must be addressed: the granularity of the code region, when the prefetcher is switched on, and when the prefetcher is switched off

    Comparative Study of Prefetching Mechanisms

    No full text

    Comparative Study of Prefetching Mechanisms

    No full text

    Improving the prefetching performance through code region profiling

    No full text
    In this work, we propose a new technique to improve the performance of hardware data prefetching. This technique is based on detecting periods of time and regions of code where the prefetcher is not working properly, thus not providing any speedup or even producing slowdown. Once these periods of time and regions of code are detected, the prefetcher may be switched off and later on, switched on. To efficiently implement such mechanism, we identify three orthogonal issues that must be addressed: the granularity of the code region, when the prefetcher is switched on, and when the prefetcher is switched off

    Network aware performance evaluation of prefetching techniques in CMPs

    No full text
    This study focuses on the importance of quantifying the effect of prefetching on the interconnection network of a multiprocessor chip. This kind of microarchitectural effects are often quantified using simulators. However, if prefetching traffic in a CMP (Chip MultiProcessor) system is to be accurately evaluated, simulators should simulate not only the memory hierarchy module and the multicore system, but also the network-on-chip. Unfortunately, no open-source simulator is able to evaluate all these elements at the same time. This paper describes how to develop a prefetching module for the gem5 CMP simulator and how to integrate this into the Ruby memory system. Moreover, by using the infrastructure developed in this study, this paper shows the importance of taking the network effect in prefetching-related studies into account, in order for accurate results to be obtained: not doing so may lead to mistaken conclusions. For this purpose, we have carried out a detailed analysis of the behavior of three different prefetching engines, providing not only the typical statistics for instructions per cycle and the miss rate, but also specific network and prefetching statistics.Peer Reviewe

    Network aware performance evaluation of prefetching techniques in CMPs

    No full text
    This study focuses on the importance of quantifying the effect of prefetching on the interconnection network of a multiprocessor chip. This kind of microarchitectural effects are often quantified using simulators. However, if prefetching traffic in a CMP (Chip MultiProcessor) system is to be accurately evaluated, simulators should simulate not only the memory hierarchy module and the multicore system, but also the network-on-chip. Unfortunately, no open-source simulator is able to evaluate all these elements at the same time. This paper describes how to develop a prefetching module for the gem5 CMP simulator and how to integrate this into the Ruby memory system. Moreover, by using the infrastructure developed in this study, this paper shows the importance of taking the network effect in prefetching-related studies into account, in order for accurate results to be obtained: not doing so may lead to mistaken conclusions. For this purpose, we have carried out a detailed analysis of the behavior of three different prefetching engines, providing not only the typical statistics for instructions per cycle and the miss rate, but also specific network and prefetching statistics.Peer Reviewe
    corecore